Notebook:
https://www.kaggle.com/thebrownviking20/topic-modelling-with-spacy-and-scikit-learn
Nirant's latest kernel on spaCy: Hitchhiker's Guide to NLP in spaCy has made me realize that spaCy maybe as good or even better than NLTK for Natural Language Processing. My recent kernels deal with deep learning and I want to extend that by using text data for deep learning and intend to use spaCy for processing and modelling this data.
# Usual imports
import numpy as np
import pandas as pd
from tqdm import tqdm
import string
import matplotlib.pyplot as plt
from sklearn.decomposition import NMF, LatentDirichletAllocation, TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.manifold import TSNE
import concurrent.futures
import time
import pyLDAvis.sklearn
from pylab import bone, pcolor, colorbar, plot, show, rcParams, savefig
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
import os
wine_data = os.getcwd() + "\\data\\input\\"
print(os.listdir(wine_data))
# Plotly based imports for visualization
import plotly
from plotly import tools
import plotly.plotly as py
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff
plotly.tools.set_credentials_file(username=os.environ['PLOTLY_USERNAME'], api_key=os.environ['PLOTLY_API_KEY'])
# spaCy based imports
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
# Loading data
wines = pd.read_csv(wine_data + '\\winemag-data_first150k.csv')
wines.head()
# Creating a spaCy object
nlp = spacy.load('en_core_web_lg')
spaCy also comes with a built-in named entity visualizer that lets you check your model's predictions in your browser. You can pass in one or more Doc objects and start a web server, export HTML files or view the visualization directly from a Jupyter Notebook.
Named Entity Recognition is an information extraction task where named entities in unstructured sentences are located and classified in some pre-defined categories such as the person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.
doc = nlp(wines["description"][3])
spacy.displacy.render(doc, style='ent',jupyter=True)
## Stopwords
from IPython.display import Image
import os
Images = os.getcwd() + "\Images"
Image(filename= Images + '\StopWords.png')
punctuations = string.punctuation
stopwords = list(STOP_WORDS)
print("Number of Stop Words wrt to spaCy is: ", len(stopwords))
It is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. Words like "ran" and "running" are converted to "run" to avoid having words with similar meanings in our data.
review = str(" ".join([i.lemma_ for i in doc]))
doc = nlp(review)
spacy.displacy.render(doc, style='ent',jupyter=True)
The sentence looks much different now that it is lemmatized.
This is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech,[1] based on both its definition and its context—i.e., its relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.
# POS tagging
for i in nlp(review):
print(i,"=>",i.pos_)
# Parser for reviews
parser = English()
def spacy_tokenizer(sentence):
mytokens = parser(sentence)
mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
mytokens = [ word for word in mytokens if word not in stopwords and word not in punctuations ]
mytokens = " ".join([i for i in mytokens])
return mytokens
tqdm.pandas()
wines["processed_description"] = wines["description"].progress_apply(spacy_tokenizer)
In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract "topics" that occur in a collection of documents. Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body. Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: "dog" and "bone" will appear more often in documents about dogs, "cat" and "meow" will appear in documents about cats, and "the" and "is" will appear equally in both. A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about cats and 90% about dogs, there would probably be about 9 times more dog words than cat words.
The "topics" produced by topic modeling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document's balance of topics is. It involves various techniques of dimensionality reduction(mostly non-linear) and unsupervised learning like LDA, SVD, autoencoders etc.
Source: Wikipedia
# Creating a vectorizer
vectorizer = CountVectorizer(min_df=5, max_df=0.9, stop_words='english', lowercase=True, token_pattern='[a-zA-Z\-][a-zA-Z\-]{2,}')
data_vectorized = vectorizer.fit_transform(wines["processed_description"])
NUM_TOPICS = 10
%%time
# Latent Dirichlet Allocation Model
lda = LatentDirichletAllocation(n_components=NUM_TOPICS, max_iter=10, learning_method='online',verbose=True)
data_lda = lda.fit_transform(data_vectorized)
%%time
# Non-Negative Matrix Factorization Model
nmf = NMF(n_components=NUM_TOPICS)
data_nmf = nmf.fit_transform(data_vectorized)
%%time
# Latent Semantic Indexing Model using Truncated SVD
lsi = TruncatedSVD(n_components=NUM_TOPICS)
data_lsi = lsi.fit_transform(data_vectorized)
# Functions for printing keywords for each topic
def selected_topics(model, vectorizer, top_n=10):
for idx, topic in enumerate(model.components_):
print("Topic %d:" % (idx))
print([(vectorizer.get_feature_names()[i], topic[i])
for i in topic.argsort()[:-top_n - 1:-1]])
# Keywords for topics clustered by Latent Dirichlet Allocation
print("LDA Model:")
selected_topics(lda, vectorizer)
# Keywords for topics clustered by Latent Semantic Indexing
print("NMF Model:")
selected_topics(nmf, vectorizer)
# Keywords for topics clustered by Non-Negative Matrix Factorization
print("LSI Model:")
selected_topics(lsi, vectorizer)
# Transforming an individual sentence
text = spacy_tokenizer("Aromas include tropical fruit, broom, brimstone and dried herb. The palate isn't overly expressive, offering unripened apple, citrus and dried sage alongside brisk acidity.")
x = lda.transform(vectorizer.transform([text]))[0]
print(x)
The index in the above list with the largest value represents the most dominant topic for the given review.
pyLDAvis.enable_notebook()
dash = pyLDAvis.sklearn.prepare(lda, data_vectorized, vectorizer, mds='tsne')
dash
1. Topics on the left while their respective keywords are on the right.
2. Larger topics are more frequent and closer the topics, mor the similarity
3. Selection of keywords is based on their frequency and discriminancy.
Hover over the topics on the left to get information about their keywords on the right.
We will be visualizing our data for 2 topics to see similarity between keywords which is measured by distance with the markers
svd_2d = TruncatedSVD(n_components=2)
data_2d = svd_2d.fit_transform(data_vectorized)
# Plotly based imports for visualization
import plotly
from plotly import tools
import plotly.plotly as py
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff
plotly.tools.set_credentials_file(username=os.environ['PLOTLY_USERNAME'], api_key=os.environ['PLOTLY_API_KEY'])
trace = go.Scattergl(
x = data_2d[:,0],
y = data_2d[:,1],
mode = 'markers',
marker = dict(
color = '#FFBAD2',
line = dict(width = 1)
),
text = vectorizer.get_feature_names(),
hovertext = vectorizer.get_feature_names(),
hoverinfo = 'text'
)
data = [trace]
iplot(data, filename='scatter-mode')
trace = go.Scattergl(
x = data_2d[:,0],
y = data_2d[:,1],
mode = 'text',
marker = dict(
color = '#FFBAD2',
line = dict(width = 1)
),
text = vectorizer.get_feature_names()
)
data = [trace]
iplot(data, filename='text-scatter-mode')
def spacy_bigram_tokenizer(phrase):
doc = parser(phrase) # create spacy object
token_not_noun = []
notnoun_noun_list = []
noun = ""
for item in doc:
if item.pos_ != "NOUN": # separate nouns and not nouns
token_not_noun.append(item.text)
if item.pos_ == "NOUN":
noun = item.text
for notnoun in token_not_noun:
notnoun_noun_list.append(notnoun + " " + noun)
return " ".join([i for i in notnoun_noun_list])
bivectorizer = CountVectorizer(min_df=5, max_df=0.9, stop_words='english', lowercase=True, ngram_range=(1,2))
bigram_vectorized = bivectorizer.fit_transform(wines["processed_description"])
bi_lda = LatentDirichletAllocation(n_components=NUM_TOPICS, max_iter=10, learning_method='online',verbose=True)
data_bi_lda = bi_lda.fit_transform(bigram_vectorized)
print("Bi-LDA Model:")
selected_topics(bi_lda, bivectorizer)
bi_dash = pyLDAvis.sklearn.prepare(bi_lda, bigram_vectorized, bivectorizer, mds='tsne')
bi_dash
#environment and package versions
print('\n')
print("_"*70)
print('The environment and package versions used in this script are:')
print('\n')
import platform
import sys
import bs4
from bs4 import BeautifulSoup
import urllib
import re
import textacy
import spacy
import gensim
import sklearn
import scipy
import matplotlib
import cufflinks as cf
import IPython
import mglearn
print(platform.platform())
print('Python', sys.version)
print("pandas version:", pd.__version__)
print('OS', os.name)
print('Numpy', np.__version__)
print('Beautiful Soup', bs4.__version__)
print('Urllib', urllib.request.__version__)
print('Regex', re.__version__)
print('Textacy', textacy.__version__)
print('spaCy', spacy.__version__)
print('gensim', gensim.__version__)
print('scikit-learn version', sklearn.__version__)
print('scipy', scipy.__version__)
print('matplotlib', matplotlib.__version__)
print('plotly', plotly.__version__)
print('Cufflinks', cf.__version__)
print("IPython version:", IPython.__version__)
print("mglearn version:", mglearn.__version__)
print ("Anaconda Python Environment is: ", os.environ['CONDA_DEFAULT_ENV'])
print('\n')
print("~"*70)
print('\n')